Faster solutions of the inverse pairwise Ising problem 
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Recent work has shown that probabiUstic models based on pairwise interactions — in the simplest 
case, the Ising model — provide surprisingly accurate descriptions of experiments on real biological 
networks ranging from neurons to genes. Finding these models requires us to solve an inverse prob- 
lem: given experimentally measured expectation values, what are the parameters of the underlying 
Hamiltonian? This problem sits at the intersection of statistical physics and machine learning, and 
we suggest that more efficient solutions are possible by merging ideas from the two fields. We use 
a combination of recent coordinate descent algorithms with an adaptation of the histogram Monte 
Carlo method, and implement these techniques to take advantage of the sparseness found in data on 
real neurons. The resulting algorithm learns the parameters of an Ising model describing a network 
of forty neurons within a few minutes. This opens the possibility of analyzing much larger data sets 
now emerging, and thus testing hypotheses about the collective behaviors of these networks. 



I. INTRODUCTION 

In the standard problems of statistical mechanics, we 
begin with the definition of the Hamiltonian and pro- 
ceed to calculate the expectation values or correlation 
functions of various observable quantities. In the in- 
verse problem, we are given the expectation values and 
try to infer the underlying Hamiltonian. The history of 
the inverse problems goes back (at least) to 1959, when 
Keller and Zumino [T] showed that, for a classical gas, the 
temperature dependence of the second virial coefficient 
determines the interaction potential between molecules 
uniquely, provided that this potential is monotonic. Sub- 
sequent work on classical gases and fluids considered the 
connection between pair correlation functions and inter- 
action potentials in various approximations ^ , and more 
rigorous constructions of Boltzmann distributions consis- 
tent with given spatial variations in density [3J or higher 
order correlation functions [4 . 

In fact the inverse problem of statistical mechanics 
arises in many different contexts, with several largely in- 
dependent literatures. In computer science, there are a 
number of problems where we try to learn the probabil- 
ity distribution that describes the observed correlations 
among a large set of variables in terms of some (hope- 
fully) simpler set of interactions. Many algorithms for 
solving these learning problems rest on simplifications 
or approximations that correspond quite closely to es- 
tablished approximation methods in statistical mechan- 
ics [5J. More explicitly, in the context of neural network 
models [6, ^7^, a family of models referred to as 'Boltz- 
mann machines' lean directly on the mapping of proba- 
bilistic models into statistical physics problems, identi- 
fying the parameters of the probabilistic model with the 
coupling constants in an Ising-like Hamiltonian [5]. 

Inverse problems in statistical mechanics have received 
new attention because of attempts to construct explicit 
network models of biological systems. Physicists have 
long hoped that the collective behavior which emerges 



from statistical mechanics could provide a model for the 
emergence of function in biological systems, and this gen- 
eral class of ideas has been explored most fully for net- 
works of neurons [B] [7]. Recent work shows how these 
ideas can be linked much more directly to experiment 
O llOj by searching for maximum entropy models that 
capture some limited set of measured correlations. At a 
practical level, implementing this program requires us to 
solve a class of inverse problems for Ising models with 
pairwise interactions among the spins, and this is the 
problem that we consider here. 

To be concrete, we consider a network of neurons. 
Throughout the brain, neurons communicate by gener- 
ating discrete, identical electrical pulses termed action 
potentials or spikes. If we look in a small window of 
time, each neuron either generates a spike or it does not, 
so that there is a natural description of the instantaneous 
state of the network by a collection of binary or Ising vari- 
ables; a-i — +1 indicates that neuron i generates a spike, 
and (Ji = — 1 indicates that neuron i is silent. Knowing 
the average rate at which spikes are generated by each 
cell is equivalent to knowing the expectation values (di) 
for all i. Similarly, knowing the probabilities of coincident 
spiking (correlations) among all pairs of neurons is equiv- 
alent to knowing the expectation values (cTifTj). Of course 
there are an infinite number of probability distributions 
P((t) over the states of the whole system (cr = {ffi}) 
that are consistent with these expectation values, but if 
we ask for the distribution that is as random as possible 
while still reproducing the data — the maximum entropy 
distribution — then this has the form of an Ising model 
with pairwise interactions: 
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The inverse problem is to find the "magnetic fields" {hi} 
and "exchange interactions" {Jij} that reproduce the ob- 
served values of (ai) and (cTicrj). 
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The surprising result of Ref [S] was that this Ising 
model provides an accurate quantitative description of 
the combinatorial patterns of spiking and silence ob- 
served in groups of order A'^ = 10 neurons in the retina as 
it responds to natural sensory inputs, despite only taking 
account of pairwise interactions. The Ising model allows 
us to understand how, in this system, weak correlations 
among pairs of neurons can coexist with strong collective 
effects at the network level, and this is even clearer as one 
extends the analysis to larger groups (using real data for 
N = 40, and extrapolating to N — 120), where there 
is a hint that the system is poised near a critical point 
in its dynamics [lOj . Since the initial results, a number 
of groups have found that the maximum entropy models 
provide surprisingly accurate descriptions of other neu- 
ral systems ^I] [T^] |T3] and similar approaches have been 
used to look at biochemical and genetic networks [HI [15] . 

The promise of the maximum entropy approach to bio- 
logical networks is that it builds a bridge from easily ob- 
servable correlations among pairs of elements to a global 
view of the collective behavior that can emerge from the 
network as a whole. Clearly this potential is greatest 
in the context of large networks. Indeed, even for the 
retina, methods are emerging that make it possible to 
record simultaneously from hundreds of neurons [TBI [Tf] . 
so just keeping up with the data will require methods 
to deal with much larger instances of the inverse prob- 
lem. The essential difficulty, of course, is that once we 
have a large network, even checking that a given set of 
parameters {/ij, Jij} reproduce the observed expectation 
values requires a difficult calculation. In Ref [10] we took 
an essentially brute force Monte Carlo approach to this 
part of the problem, and then adjusted the parameters 
to improve the match between observed and predicted 
expectation values using a relatively naive algorithm. 

In this work we combine several ideas — taken both 
from statistical physics and from machine learning |18| — 
which seem likely to help arrive at more efficient solutions 
of the inverse problem for the pairwise Ising model. First, 
we adapt the histogram Monte Carlo method [19] to 're- 
cycle' the Monte Carlo samples that we generate as we 
make small changes in the parameters of the Hamilto- 
nian. Second, we use a coordinate descent method to 
adjust the parameters [20]. Finally, we exploit the fact 
that neurons use their binary states in a very asymmetric 
fashion, so that silence is much more common that spik- 
ing. Combining these techniques, we are able to solve the 
inverse problem for = 40 neurons in tens of minutes, 
rather than many days for the naive approach, holding 
out hope for generalization to yet larger problems. 



II. INGREDIENTS OF OUR ALGORITHM 

A. Basic formulation 

Our overall goal is to build a model for the distribu- 
tion P((t) over the states of the a system with N ele- 



ments, a = {(Ti, (72, cr^v}. As ingredients for deter- 
mining this model, we use low order statistics computed 
from a set of m samples {cr^, cr^, • • • , cr™}, which we can 
think of as samples drawn from the distribution P(cr). 
The classical idea of maximum entropy models is that 
we should construct P{a) to generate the correct values 
of certain average quantities (e.g., the energy in the case 
of the Boltzmann distribution), but otherwise the distri- 
bution should be 'as random' as possible [5T]. Formally 
this means that we find -P(cr) as the solution of a con- 
strained optimization problem, maximizing the entropy 
of the distribution subject to conditions that enforce the 
correct expectation values. We will refer to the quanti- 
ties whose averages are constrained as "features" of the 
system, f = {/i, /2, ■ ■ • , /a:}, where each is a function 
of the state cr, f^{<j). 

One special set of average features are just the 
marginal distributions for subsets of the variables. Thus 
we can construct the one-body marginals 

^i(^i)= ^(^i,^2,---,M, (2) 

the two-body marginals, 

-Pij(^i'f^j) = X! -P(c^i,cr2, • • ■ ,crjv), (3) 

and so on for larger subsets. The maximum entropy 
distributions consistent with marginal distributions up 
to K-hody terms generates a hierarchy of models that 
capture increasingly higher-order correlations, monoton- 
ically reducing the entropy of the model as K increases, 
toward the true value [^ . 

Let P denote the empirical distribution 
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P{cr)=-^5{a,a^), 

n=l 



(4) 



where d{a, cr') is the Kronecker delta, equal to one when 
a = g' and equal to zero otherwise. The maximum- 
entropy problem is then 

maxp S\P\ such that (f (ct)) p = (f (ct)) p , (5) 

where S[P] denotes the entropy of the distribution P, and 
(• • •)p denotes an expectation value with respect to that 
distribution. Using the method of Lagrange multipliers, 
the solution to the maximum entropy problem has the 
form of a Boltzmann or Gibbs distribution Hll , 
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(6) 



where as usual the partition function 2(6*) is the normal- 
ization constant ensuring that the distribution Qg sums 
to one, and the parameters {^i, ^2, • • • , ^k} = G 
correspond to the Lagrange multipliers. Note that the ex- 
pression for Qg above describes the pairwise Ising model 



K 



3 



Eq ([T]), with 9 = {hi,Jij}, and the features f are the 
one-spin and two-spin combinations {crj, iTicrj}. 

Rather than thinking of our problem as that of maxi- 
mizing the entropy subject to constraints on the expec- 
tation values, we can now think of our task as searching 
the space of Gibbs distributions, parameterized as in Eq 
^ , to find the values of the parameters 9 that generate 
the correct expectation values. Importantly, because of 
basic thermodynamic relationships, this search can also 
be formulated as an optimization problem. Specifically, 
we recall that expectation values in statistical mechanics 
can be written as derivatives of the free energy, which in 
turn is the logarithm of the partition function (up to fac- 
tors of the temperature, which isn't relevant here). Thus, 



for distribution in the form of Eq ^ , we have 

a In 2(6*) 



d9„ 



(7) 



Enforcing that these expectation values are equal to the 
expectation values computed from our empirical samples 
means solving the equation 
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But this can be written as 
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Thus we see that matching the empirical expectation val- 
ues is equivalent to looking for a local extremum (which 
turns out to be a maximum) of the quantity 



Comparing with Eq ( 12 1, we obtain the dual formulation 
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But if all the samples {cr^, (T^, • • • , cr"} are drawn in- 
dependently, then the total probability that the Gibbs 
distribution with parameters 9 generates the data is 
-Ptotai = n„<3e((^"), and so the quantity in Eq ([l2| is 
(up to a factor of m), just InPtotai- Finding the maxi- 
mum entropy distribution thus is equivalent to maximiz- 
ing the probability or likelihood that our model generates 
the observed samples, within the class of models defined 
by the Gibbs distribution Eq 

We recall that, in information theory probabil- 
ity distributions implicitly define strategies for encoding 
data, and the shortest codes are achieved when our model 
of the distribution actually matches the distribution from 
which the data are drawn. Since code lengths are related 
to the negative logarithm of probabilities, it is convenient 
to define the cost of coding or log loss Lp(0) that arises 
when we use the model with parameters 9 to describe 
data drawn from the empirical distribution P: 



^ m 

-VlnQe(fT") = (-lnQe(fT))j 



(13) 



of the maximum entropy problem, 
mine Lp(^)- 



(14) 



Why is the optimization problem in Eq ( 14 1 difficult? 
In principle, the convexity properties of free energies 
should make the problem well behaved and tractable. 
But it remains possible for Lp(0) to have a very sensi- 
tive dependence on the parameters 9, and this can cause 
practical problems, especially if, as suggested in Ref [10], 
the systems we want to describe are poised near a critical 
point. Even before encountering this problem, however, 
we face the difficulty that computing ^p{9) or even its 
gradient in parameter space involves computing expecta- 
tion values with respect to the distribution Qg(a). Once 
the space of states becomes large, it is no longer possi- 
ble to do this by exact enumeration. We can try to use 
approximate analytical methods, or we can use Monte 
Carlo methods. 



B. Monte Carlo methods 

Our general strategy for solving the optimization prob- 
lem in Eq ( [l4| will be to use standard Monte Carlo simu- 
lations [24] [251 to generate samples from the distribu- 
tion Qg{cr), approximate the relevant expectation values 
as averages over these samples, and then use the results 
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to propose changes in the parameters so as to proceed 
toward a minimum of Lp(0). Implemented naively, as in 
Ref [TU], this procedure is hugely expensive, because at 
each new setting of the parameters we have to generate a 
new set of Monte Carlo samples. Some of this cost can be 
avoided using the ideas of histogram Monte Carlo [19] . 

We recall that if we want to compute the expectation 
value of some function <&(cr) in the distribution Qg'{cr), 
this can be written as 



<i>(a) 



\ Qeia) 

(<i>{a)exp[-{e'-ey{{a)])e 
(exp[-(0'-0).f(a)])e 



(15) 
(16) 

(17) 
(18) 



where we denote expectation values in the distribution 
Qeia) by (• • ■)g, and similarly for 6'. We note that Eq 
( 18 1 is exact. The essential step of histogram Monte 
Carlo is to use an approximation to this equation, re- 
placing the expectation value in the distribution Qg by 
an average over a set of samples drawn from Monte Carlo 
simulation of this distribution. 

Consider an algorithm A which searches for a mini- 
mum of Lp{9). As this algorithm progresses, the values 
of the parameters 9 change slowly. We will divide these 
changes into stages s = 1, 2, • • •, and within each stage we 
will perform t = 1, 2, • • • , T iterations. At the first itera- 
tion, we will generate, via Monte Carlo, M samples of the 
state a drawn out of the distribution appropriate to the 
current value of 9 — 6{s,t — 1). Let us refer to averages 
over these samples as {■■■)mc0j which should approxi- 
mate {■■■)g- At subsequent iterations, the parameters 
9 will be adjusted (see below for details), but we keep 
the same Monte Carlo samples and approximate the ex- 
pectation values over the distribution with parameters 
9' = e{s,t) as 
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(19) 



C. Parameter adjustment and the use of sparsity 

At the core of our problem is an algorithm which tries 
to adjust the parameters 9 so as to minimize hp{9). In 
Ref we used a simple gradient descent method. Here 
we use a coordinate descent method [20], adapted from 
Ref [27] and tuned specifically for the maximum entropy 
problem. Where gradient descent methods attempt to 
find a vector in parameter space along which one can 
achieve the maximum reduction in L, coordinate descent 
methods explore in sequence the individual parameters 

Beyond the general considerations in Refs [501 127] , co- 
ordinate descent may be especially well suited to our 
problem because of the the sparsity of the feature vectors. 
In implementing any parameter adjustment algorithm, it 
is useful to take account of the fact that in networks of 
real neurons, spikes and silences are used very asymmet- 
rically. It thus is useful to write the basic variables as 
rii — (di + l)/2 e {0, 1}. This involves a slight redefini- 
tion of the parameters {/li, Jy}, but once this is done one 
finds that individual terms cx rij or cx nirij are often zero, 
because spikes (corresponding to rti — 1) are rare. This is 
true not just in the experimental data, but of course also 
in the Monte Carlo simulations once we have parameters 
that approximately reproduce the overall probability of 
spiking vs. silence. This sparsity can lead to substantial 
time savings, since all expectation values can be eval- 
uated in time proportional to the number of non-zero 
elements [28] . 

As an alternative to coordinate descent method, we 
also used a general purpose convex optimization algo- 
rithm known as the limited memory variable metric 
(LMVM) [50], which is known to outperform others such 
general purpose algorithms on a wide variety of prob- 
lems [30]. While LMVM has the advantage of changing 
multiple parameters at a time, the cost of updating may 
outweigh this advantage, especially for very sparse data 
such as ours. 

Both the coordinate descent and the LMVM schemes 
are initialized with parameters [in the notation of Eq ([l])] 
Jij = and the hi chosen to exactly reproduce the ex- 
pectation values ((Ti). Our Monte Carlo method follows 
the discussion of the Gibbs sampler in Ref [26) . All com- 
putations were on the same computing cluster |31| used 
in Ref [TU]. 



We denote this approximation as {■ ■ ■)g' ~ {■ ■ ■)gr^g. Once 
we reach i = T, we run a new Monte Carlo simulation 
appropriate to the current value of 9, and the cycle begins 
again at stage s = s + 1. This strategy is summarized as 
pseudocode in Fig[T] 

Once we chose the optimization algorithm A, there are 
still two free parameters in our scheme: the number of 
samples M provided by the Monte Carlo simulation at 
each stage, and the number of iterations per stage T. In 
our experiments, we explore how choices of these parame- 
ters influence the convergence of the resulting algorithms. 



III. EXPERIMENTS 

In this section, we evaluate the speed of convergence 
of our algorithms as a function of M (the number of 
Monte Carlo samples drawn in a sampling round) and 
T (the number of learning steps per stage) under a fixed 
time budget. We begin with synthetic data, for which we 
know the correct answer, and then consider the problem 
of constructing maximum entropy models to describe real 
data. 
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Input: empirical observations ,■ ■ ■ , a'" 
parameters M and T 
access to maxent inference algorithm A 
Algorithm: 

initialize A 

9 (1,1) <— initial parameter vector computed by A 
for s = 1,2,... 

generate M Monte Carlo samples using 6{s, 1) 

fort = 1,...,T 

run one iteration of A, approximating {■ ■ ■)e{3,t) = {■ • ■)9(s,t)|e(s,i) 
6{s, t + 1) <— parameter vector computed by A on this iteration 
e{s + l,l)^e{s,T+l) 



FIG. 1: Pseudocode for our scheme, which reuses a Monte Carlo sample set across T iterations of a maxent algorithm A. 



A. Synthetic data 

The maximum entropy construction is a strategy for 
simplifying our description of a system with many inter- 
acting elements. Separate from the algorithmic question 
of finding the maximum entropy model is the natural sci- 
ence question of whether this model provides a good de- 
scription of the system we are studying, making success- 
ful predictions beyond the set of low order correlations 
that are used to construct the model. To sharpen our 
focus on the algorithmic problem, we use as "empirical 
data" a set of samples generated by Monte Carlo simu- 
lation of an Ising model as in Eq (jlj . To get as close as 
possible to the problems we face in analyzing real data, 
we use the parameters of the Ising model found in Ref f lOj 
in the analysis of activity in a network of iV = 40 neurons 
in the retina as it responds to naturalistic movies. We 
note that this model has competing interactions, as in a 
spin glass, with multiple locally stable states; extrapola- 
tion to larger systems suggests that the parameters are 
close to a critical point. 

Starting with the parameters = {^ii "^ij} determined 
in Ref [TU], we generate to = 2 x 10^ samples by Monte 
Carlo simulation. Let us call the empirical distribution 
over these samples P. Then we proceed to minimize 
Lp{6). To monitor the progress of the algorithm, we 
compute 



AL(0) = Lp(0)-Lp(0o). 



(20) 



Note that this computation requires us to estimate the 
log ratios of partition functions, for which we again use 
the histogram Monte Carlo approach. 



m 

Z{0o) 



(exp[-((?-0o)-fM])eo 



(21) 



(22) 



MCOa 



As a check on this approximation, we can exchange the 
roles of 6 and 6*0, 



Z(go) 

m 



exp[+(^^-0o)-f(T) 



(23) 



and test for consistency between the two results. Finally, 
to compare the performance of the two optimization al- 
gorithms, we withhold some fraction of the data, here 
chosen to be 10%, for testing. 

Figure |2] illustrates the performance of the two algo- 
rithms on the synthetic dataset, measured by AL. for 
simplicity, we have held the number of Monte Carlo sam- 
ples at each stage, M = 3.2 x 10^, fixed. Choosing T = 1 
iteration per stage corresponds to a naive use of Monte 
Carlo, with a new simulation for each new setting of the 
parameters, and we see that this approach converges very 
slowly, as found in Ref |10j . Simply increasing this num- 
ber to T = 20 produces at least an order of magnitude 
speed up in convergence. 



Coordinate update; 1 iteration per stage 
LMVM; 1 iteration per stage 

■ Coordinate update; 20 iterations per stage 

■ LMVM; 20 iterations per stage 




Mce 



200 300 400 

running time (sec) 



FIG. 2: Approximate difference in the test log loss between 
the current solution of an algorithm and the exact model that 
generates the test data, as a function of the algorithm running 
time. The number of Monte Carlo samples at each stage is set 
to 320,000. The two choices of optimization algorithm exhibit 
similar performance, but much greater efficiency is achieved 
in both cases by including multiple iterations of learning al- 
gorithm A per stage. 
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FIG. 3: (Left) Performance of our algorithm on the empirical data according to the absolute correlation difference metric, as 
a function of the number of iterations of the optimization algorithm and the number of Monte Carlo samples. The algorithm 
was terminated after the running time exceeded 300 seconds. Choosing too many or too few iterations per stage decreases 
efficiency (due to overlearning or inefficient use of samples, respectively). (Right) Performance of our algorithm on the empirical 
data according to the absolute correlation difi'erence metric, as a function of the number of examples generated by the Monte 
Carlo sampler. The algorithm was terminated after the running time exceeded 300 seconds. Also displayed is the total number 
of iterations completed by the time cutoff. Fewer iterations per sample allow the generation of more samples, which is a 
computationally expensive process; therefore, fewer iterations result in fewer total optimization stages. Choosing too many or 
too few Monte Carlo samples per stage decreases efficiency, either because the algorithm samples for too long and terminates 
before enough learning stages can be completed, or because the sampling is too sparse and expectations are poorly estimated. 



B. Real data 

Here we consider the problem of constructing the max- 
imum entropy distribution consistent with the correla- 
tions observed in real data. Our example is from Refs 
[HI HO], the responses of = 40 retinal neurons. As 
noted at the outset, if we divide time into small slices of 
duration Ar then each cell either does or does not gen- 
erate an action potential, corresponding to a binary or 
Ising variable. The experiments of Ref [9], analyzed with 
Ar = 0.02 s, provide m = 189,950 samples of the state 
a. We note that spikes are rare, and pair wise correlations 
are weak, as discussed at length in Ref [H]. 

As we try to find the parameters 9 that describe these 
data, we need a metric to monitor the progress of our 
algorithm; of course we don't have access to the true dis- 
tribution. Since we are searching in the space of Gibbs 
distributions which automatically have correct functional 
form to maximize the entropy, we check the quality of 
agreement between the predicted and observed expec- 
tation values. It is straightforward to insure that the 
model reproduces the one-body averages (cti) almost ex- 
actly. Thus we focus on the (connected) two-body aver- 
ages Cij = (cTiCTj) — (cri)((Tj). We compute these averages 
via Monte Carlo in the distribution Qg (a) , and then form 
the absolute difference between computed and observed 
values of Cij , and finally average over all pairs ij . The re- 
sulting average error AC measures how close we come to 
solving the maximum entropy problem. Note that since 
the observed values of Ci are based on a finite set of 



samples, we don't expect that they are exact, and hence 
we shouldn't identify convergence with AC 0. Instead 
we divide the real data in half and measure AC between 
these halves, and use this as the 'finish line' for our algo- 
rithms. 

In Fig|3]we see the behavior of AC as a function of the 
number of iterations per stage (T) , where we terminate 
the algorithm after 300 seconds of running time on our 
local cluster [21]. As expected intuitively, both small 
T and large T perform badly, but there is a wide range 
T ~ 30—200 in which the algorithm reaches the finish line 
within the alloted time. This performance also depends 
on M, the number of Monte Carlo samples per stage, 
so that again there is a range of Af '--^ 1 — 3 x 10^ that 
seems to work well. In effect, when we constrain the 
total run time of the algorithm there is a best way of 
apportioning this time between stages (in which we run 
new Monte Carlo simulations) and iterations (in which 
we adjust parameters using estimates based on a fixed 
set of Monte Carlo samples) . By working within a factor 
of two of this optimum, we achieve substantial speed up 
of the algorithm, or an improvement in convergence at 
fixed run time. 



IV. CONCLUSION 

Our central conclusion is that recycling of Monte Carlo 
samples, as in the histogram Monte Carlo method |19j . 
provides a substantial speedup in the solution of the in- 
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verse Ising problem. In detail, some degree of recycling 
speeds up the computations, while of course too much 
recycling renders the underlying approximations invalid, 
so that there is some optimal amount of recycling; for- 
tunately it seems from Fig [3] that this optimum is quite 
broad. We expect that these basic results will be true 
more generally for problems in which we have to learn 
the parameters of probabilistic models to provide the best 
match to a large body of data. In the specific context of 
Ising models for networks of real neurons, the experimen- 
tal state of the art is now providing data on populations 
of neurons which arc sufficiently large that these issues 



of algorithmic efficiency become crucial. 
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